https://github.com/hbhjj/IODS-project
The aim of this course is to get familiarized with open data science. This includes open data and open source tools.
This week I studied doing regression analysis with R in the IODS-course. Regression analysis is a method for estimating relationships between variables. In this exercise, we will look at the connection of learning strategies and attitude to the course points of one statistics course
More detailed description of the dataset can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt
lets read the data from the disk
learning2014 <- read.table(file= 'data/learning2014.txt', header=T)
Dimensions and structure of the data
dim(learning2014)
## [1] 166 7
There are 166 observations and 7 variables in the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.75 2.88 3.88 3.5 3.75 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
The 7 variables in the dataframe are gender, Age, attitude, deep, stra, surf and Points
Lets check the summary and visualize the data
If not already installed, you should run the command install.packages(c(“ggplot2”,“GGally”)) to install required packages for data visualization
The summaries of the variables
summary(learning2014)
## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.625 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.500 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.875 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.796 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.250 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.875 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
draw a scatter plot matrix of the variables in learning2014. [-1] excludes the first column (gender)
pairs(learning2014[-1], col=learning2014$gender)
create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
draw the plot
p
It looks like Points is correlated most heavily with attitude, stra and surf. We will use these variables later in our regression model. Students tend to be fairly young and there is more females than males, which is not that suprising - these are university students, after all. Deep learning has its peak on the higher scores.
Next, we should build a model with the variables described above.
Lets create a regression model with attitude, stra and surf as explanatary variables, Points to be explained and print a summary of it
r_model <- lm(Points ~ attitude + stra + surf, data = learning2014)
summary(r_model)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
stra and surf did not have statistically significant connection to Points, attitude did. As it is now, the model explains around 0.19 percent of variability of Points is explained by the model according to adjusted R-squared value. 1 point increase in attitude implies a 3.4 point increase in Points if one believes the model. Intercept is around 11, so if explanatory variables are 0, Points should be around 11.
As a final touch, lets see graphical analysis of the results by plotting Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow=c(2,2))
plot(r_model, which= c(1,2,5))
Residuals vs Fitted seems to be fairly random, so the errors should not depend on explanatory variables Q-Q plot seems such that the errors of the model seem fairly normally distributed. There is some deviation in the beginning and the end, but it does not seem dramatic. Residuals vs Residuals plot shows some outliers, but their leverage is not drastically high compared to other datapoints.
This week, we will explore student alcohol usage. The dataset has, for example, questions related to alcohol usage, social relationships, health and studying. Description of the data can be found here https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
load dplyr, GGally, tidyr and ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)
read the data from data folder
alc <- read.table(file = 'data/alc.txt', header = T)
glimpse the data
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
Data has 382 observations and 35 variables.
I chose to see if quality of family relationships, family support of education, going out and current health status affect heavy drinking. Rationale behind the choice of first two variables is that may be plausible that social support makes heavy drinking less probable. Going out with friends offers opportunities to drink and there might be social pressure to do so. Overall bad health may lead to drinking. Out of the four variables, family support is binary variables. Family relationships, going out and health are measured on 5 point likert skale.
Now, lets do a subset of data that only has the variables we are interested in and draw bar plots out of them.
alc_sub <- alc[,c("famsup","famrel","goout","health", "high_use")]
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc_sub) %>% glimpse
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Observations: 1,910
## Variables: 2
## $ key <chr> "famsup", "famsup", "famsup", "famsup", "famsup", "famsu...
## $ value <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
# draw a bar plot of each variable
gather(alc_sub) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped
Participants tend to have fairly good family relationships. Most have family support for their studies. Health is skewed towards people feeling very healthy: it is not normally distributed. There is less people in high alcohol usage -category. Going out is fairly normally distributed.
Lets see explore the choices a bit
# initialize a plot of high_use and family relations
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("Family relationship")
Good family relationships seem to be connected to not being a high user of alcohol, as hypothesis went
# initialise a plot of high_use and health
g2 <- ggplot(alc, aes(x = high_use, y = health))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Health")
Maybe a bit suprisingly, health seems to be similar for both high users and low users. This might be due to participants feeling quite healthy overall.
Then, going out and drinking
# initialise a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Going out")
Those goint out with friends more tend to drink more. Not that suprising.
Then, lets see how family support and alcohol usage compare. Let’s do a crosstab out of them
#create crosstab from high use and famsup
xtabs(~alc$high_use+alc$famsup)
## alc$famsup
## alc$high_use no yes
## FALSE 98 170
## TRUE 46 68
Overall, people not using a lot of alcohol and having family support are the biggest group. Interestingly, proportianally those not getting family support in high use -group are bigger compared to all in high use than in the not high use group.
We will use logistic regression as a method of choice. Most often this method is used to predict variables that are binary, such as our high_use is. Now we will fit the variables chosen above to regression model and print out a summary of it.
alcm <- glm(high_use ~ famrel + famsup + goout + health, data = alc, family = "binomial")
summary(alcm)
##
## Call:
## glm(formula = high_use ~ famrel + famsup + goout + health, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5527 -0.8023 -0.5594 0.9769 2.4393
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.42198 0.69344 -3.493 0.000478 ***
## famrel -0.40343 0.13648 -2.956 0.003117 **
## famsupyes -0.20366 0.25136 -0.810 0.417821
## goout 0.80717 0.11913 6.776 1.24e-11 ***
## health 0.16986 0.09045 1.878 0.060383 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 404.08 on 377 degrees of freedom
## AIC: 414.08
##
## Number of Fisher Scoring iterations: 4
Okay, so only family relationship and going out has significance. Lets drop everything else and do the model again with only them as a predictor and print the coefficients.
alcm2 <- glm(high_use ~ famrel + goout, data = alc, family = "binomial")
summary(alcm2)
##
## Call:
## glm(formula = high_use ~ famrel + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5448 -0.7518 -0.5252 0.9872 2.3530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0289 0.6148 -3.300 0.000966 ***
## famrel -0.3667 0.1338 -2.740 0.006142 **
## goout 0.7922 0.1183 6.698 2.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 408.12 on 379 degrees of freedom
## AIC: 414.12
##
## Number of Fisher Scoring iterations: 4
coef(alcm2)
## (Intercept) famrel goout
## -2.0288785 -0.3666676 0.7921642
As the family relationship gets poorer, likelyhood to use large amounts of alcohol increases. As person goes out more, the likelihood of being a high user of alcohol increases.
Next, lets see confidence intervals for the coefficients as odds ratio.
OR <- coef(alcm2) %>% exp
CI <- exp(confint(alcm2))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1314829 0.03816207 0.4283774
## famrel 0.6930400 0.53161279 0.8999454
## goout 2.2081701 1.76229526 2.8045361
Going out seems to quite heavily increase the tendency for high alcohol consumption. As famrel odds ratio is less than one, it is negatively associated with high alcohol consumption: better family relationships imply a tendency to drink less. Neither of confidence intervals includes 1 which would mean equal odds and thus would indicate unreliability.
How well does the model work in prediction? Maybe the following section can answer to that.
First, confusion matrix.
probabilities <- predict(alcm2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 25
## TRUE 68 46
Model looks suprisingly good. Let’s plot it.
logmodg <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
logmodg + geom_point()
Okay, how good the model then actually is? Loss function might tell that to us. First though, lets see a table of prediction results
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63612565 0.06544503 0.70157068
## TRUE 0.17801047 0.12041885 0.29842932
## Sum 0.81413613 0.18586387 1.00000000
So the model predicted FALSE when it actually was FALSE 64% of time and TRUE when it was TRUE 12% of time.
# the logistic regression model m and dataset alc with predictions are available
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2434555
So overall, the model is wrong 24% of the guesses. This is actually not that bad.
How would this model compare to one where every prediction would be most common category, that of FALSE? Lets find out by creating a new variable only having FALSE guesses and use the loss function with it.
alc$all_false[alc$prediction == TRUE] <- FALSE
alc$all_false[alc$prediction == FALSE] <- FALSE
loss_func(class = alc$high_use, prob = alc$all_false)
## [1] 0.2984293
It would seem that the model is better than just guessing FALSE all the time, but one can get pretty good results with that guess also.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = alcm2, K = 10)
cv$delta[1]
## [1] 0.2617801
Using 10-fold cross validation, it would seem that my model is fairly similar than the datacamp-model, that had misses of around 26%. It might be that going out is just too good of a predictor: going out with friends may mean for student population drinking with friends and such is not the most useful predictor.
This week we are going to do clustering.
Load MASS, tidyr, dplyr, corrplot, reshape2 and ggplot2 packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyr)
library(ggplot2)
library(dplyr)
library(corrplot)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
Load the Boston data, print out structure, summary, matrix plot and correlation plot of the variables.
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
d <- melt(Boston)
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) +
facet_wrap(~variable,scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
corrplot(cor_matrix, type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
The Boston data has 506 rows and 14 columns. The data is about Housing values in Boston and contains variables that may explain some of the variance in the housing values, such as distances to five Boston employment centres and teacher-pupil ratios of the town. Many variables have a bit strange distributions: huge spikes and otherwise fairly low numbers.
As we will be dealing with the crime rate variable, lets discuss it a bit. Crime rate clearly is very high in some areas, crime does not distribute evenly across the ciry
It is not suprising that business areas have higher amount of pollution. Crime rate is negatively correlated with at least distances to employment centres and positively to access to radial highways. Also, richer areas, where the valuable homes are, seem to have less crime.
Explanations of variables, as they are not named intuitively:
crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.
As we are going to use LDA, we need to scale the data so that it is normally distributed and each variable has the same variance.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Create a categorical variable out of crim and drop the original. LDA is used to classify categorical variables, so we will turn the crime rate into such. At this point, I will create a copy of standardized data with original crim for the k-means clustering exercise and name it b_scaled2. This way there is no need to reload the data and do the scaling again later on.
# save the scaled crim as scaled_crim and create copy of scaled data
scaled_crim <- boston_scaled$crim
b_scaled2 <- boston_scaled
# summary of the scaled_crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Create training and test sets out of the scaled data. We will later test our model using the test dataset: we will create the model with the training set. Crime variable is removed from the test dataset.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Fit the linear discriminant analysis on the train set. Here I use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, the results are plotted.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2549505 0.2475248 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 1.0294865 -0.8848340 -0.11943197 -0.8694967 0.4188316
## med_low -0.1038437 -0.2736396 0.03346513 -0.5587237 -0.1666770
## med_high -0.3740445 0.2068438 0.23949396 0.4322225 0.1314594
## high -0.4872402 1.0171960 -0.03128211 1.0419048 -0.3825954
## age dis rad tax ptratio
## low -0.8612033 0.8843075 -0.6930819 -0.7326569 -0.48038143
## med_low -0.3598120 0.3660016 -0.5458997 -0.4746975 -0.07273871
## med_high 0.4042472 -0.4074325 -0.4030438 -0.2768361 -0.33050882
## high 0.7954284 -0.8317151 1.6373367 1.5134896 0.77985517
## black lstat medv
## low 0.37642721 -0.75690457 0.49659041
## med_low 0.30903968 -0.08285939 -0.02098492
## med_high 0.09520435 -0.01878758 0.21421892
## high -0.76663511 0.89203825 -0.70986744
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08515818 0.73007859 -1.00050104
## indus 0.04638581 -0.14693101 0.19178409
## chas -0.07063730 -0.05657489 0.13327652
## nox 0.34895374 -0.85853549 -1.23348278
## rm -0.05825978 -0.08963446 -0.10452381
## age 0.25442308 -0.18764893 -0.22031783
## dis -0.04724213 -0.26103254 0.25222906
## rad 3.14265783 1.08925112 -0.12541914
## tax 0.07404852 -0.23659064 0.55982146
## ptratio 0.07795469 -0.01073463 -0.24982788
## black -0.10256966 0.01139714 0.09753121
## lstat 0.23704453 -0.24043349 0.52142694
## medv 0.18932633 -0.48604046 -0.16318586
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9437 0.0402 0.0161
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)
The plot shows that index of accessibility to radial highways has greatest impact on classification of high crime rate areas.
Now I will predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set. The categories were saved earlier to correct_classes
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 10 1 0
## med_low 5 12 6 0
## med_high 1 12 12 1
## high 0 0 0 29
Model is fairly accurate with its prediction. Noteworthy is that it did not do any mistakes with high crime area predicitions. It is more confused with low and med_low categories.
Now I will do K-means clustering to the scaled b_scaled data that we saved earlier. I will calculate the distances between the observations and run k-means algorithm on the dataset. After this, I will investigate what is the optimal number of clusters and run the algorithm again.
# euclidean distance matrix
dist_eu <- dist(b_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(b_scaled2, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
# k-means clustering
km <-kmeans(dist_eu, centers = 15)
# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)
#Optimal cluster amount
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
With 15 clusters, it is quite hard to make any sense out of the pairs plot.
After calculating total within sum of squares and plotting it, sharpest drop is between 1 and 2, so 2 is probably the optimal cluster amount.
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the scaled Boston dataset with clusters
pairs(b_scaled2, col = km$cluster)
Index of accessibility to radial highways seems divide these clusters apart. Regarding the crime rate, one cluster has only low crime rate in it, another one both high and low. Proportion of non-retail businesses also seems to be a clear divider. It would be interesting to see these clusters plotted to an actual map.